Disambiguation of Single Noun Translations Extracted from Bilingual Comparable Corpora
نویسنده
چکیده
s of papers of four academic societies, namely Japan Architecture Society (JAS), Institute of Electric Engineering (IEE), Institute of Electronics and Communication Engineering (IECE), and Information Processing Society of Japan (IPSJ), published in Japan. Numbers of abstracts of each of these corpora are shown in Table 1. Parts of these bilingual corpora are parallel. The percentages of parallel text in the four corpora are also shown in Table 1 where “parallel texts ratio” is defined as follows. Parallel texts ratio = (ParaJ × ParaE) where ParaJ=(Number of parallel abstracts)/(Number of the whole Japanese abstracts) ParaE=((Number of parallel abstracts)/(Number of the whole English abstracts) Table 1: Corpora used in our experiment and parallel texts ratio Academic society name Number of Japanese articles Number of English articles Parallel texts ratio JAS 55,715 50,236 0.95 IEE 18,008 399 0.12 IECE 86,346 33,076 0.59 IPSJ 26,815 11,860 0.65 5.3. Morphological Analysis and POS Tagging We use morphological analyzer Chasen (Matsumoto 1997) for Japanese corpora, and Brill's tagger (Brill 1994) for English corpora to extract terms. In order to process Japanese corpora, Chasen segments out words from Japanese sentences, and at the same time it assigns a part of speech tag (henceforth we call “POS tag”) to each word. Then we extract an uninterrupted sequence of nouns as a candidate term. As easily known they are single nouns or compound nouns. On the other hand, to process English corpora, Brill’s tagger only assigns a POS tag to each word in English sentences. Then we extract, as a candidate term, the following four types of linguistic patterns written in regular expressions defined in Perl language and their combinations using assigned POS tags. (1) [noun] (2) adjective (noun | adjective) noun (3) noun “of” noun (4) foreign word
منابع مشابه
Disambiguation of Compound Noun Translations Extracted from Bilingual Comparable Corpora
Bilingual machine readable dictionaries are important and indispensable information resources for cross-language information retrieval, machine translation, and so on. In this paper, we describe a bilingual dictionary acquisition system which extracts translations from non-parallel but comparable corpora of a specific academic domain and disambiguates the extracted translations. We also experim...
متن کاملDisambiguation of Lexical Translations Based on Bilingual Comparable Corpora
Bilingual dictionaries of machine readable form are important and indispensable information resources for cross-language information retrieval (CLIR), machine translation(MT), and so on. Speci c academic areas or technology elds become focused on in these cross language informational activities. In this paper, we describe bilingual dictionary acquisition system which extracts translations from ...
متن کاملUtilizing Clues in Syntactic Relationship for Automatic Target Word Sense Disambiguation
Multiple translations to the target language are due to several meanings of source words and various target word equivalents, depending on the context of the source word. Thus, an automated approach is presented for resolving target-word selection, based on “word-to-sense” and “sense-to-word” relationship between source words and their translations, using syntactic relationships (subject-verb, ...
متن کاملExploiting Parallel Corpora for Supervised Word-Sense Disambiguation in English-Hungarian Machine Translation
In this paper we present an experiment to automatically generate annotated training corpora for a supervised word sense disambiguation module operating in an English-Hungarian and a Hungarian-English machine translation system. Training examples for the WSD module are produced by annotating ambiguous lexical items in the source language (words having several possible translations) with their pr...
متن کاملBilingual Terminology Acquisition from Comparable Corpora and Phrasal Translation to Cross-Language Information Retrieval
The present paper will seek to present an approach to bilingual lexicon extraction from non-aligned comparable corpora, phrasal translation as well as evaluations on Cross-Language Information Retrieval. A two-stages translation model is proposed for the acquisition of bilingual terminology from comparable corpora, disambiguation and selection of best translation alternatives according to their...
متن کامل